Deep Learning — Assignment 12 — Score sheet¶
Twelfth assignment for the 2024 Deep Learning course (NWI-IMC070) of the Radboud University.
Please enter your names and group number.Names:
Group:
Objectives¶
In this assignment you will
- Build a diffusion model
- Extend the model to a class-conditional version
Required software¶
As before you will need these libraries:
torchandtorchvisionfor PyTorch,
All libraries can be installed with pip install.
%matplotlib inline
import itertools
import numpy as np
import matplotlib.pyplot as plt
import torch
import time
from torch import nn
from torch.nn import functional as F
from torchvision import datasets, transforms
from IPython import display
# fix the seed, so outputs are exactly reproducible
torch.manual_seed(12345);
# Use the GPU if available
def detect_device():
if torch.cuda.is_available():
return torch.device("cuda")
elif torch.backends.mps.is_available():
return torch.device("mps")
else:
return torch.device("cpu")
device = detect_device()
12.1 MNIST dataset¶
In this assignment we will once again be using the MNIST digit dataset. This dataset consists of 28×28 binary images and has 60000 training examples divided over 10 classes. We split this into 55000 images for training and 5000 images for validation.
As preprocessing, we pad the images to 32x32 pixels and scale the intensities to [-1, +1].
(a) Run the code below to load the MNIST dataset.
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Pad(2), # pad to 32x32 pixels
transforms.Normalize(0.5, 0.5), # normalize to [-1, +1]
])
train_val_data = datasets.MNIST('data', train=True, download=True, transform=transform)
# Split into train and validation set
train_data, val_data = torch.utils.data.random_split(train_val_data, [55000, 5000])
# Create data loaders
train_loader = torch.utils.data.DataLoader(train_data, batch_size=128, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=1000)
data_loaders = {
'train': train_loader,
'val': val_loader,
}
12.2 Training images (4 points)¶
We will implement a model from the paper Denoising Diffusion Probabilistic Models by Ho et al., 2020.
We reuse some parameter settings from the paper:
diffusion_steps = 1000
beta = torch.linspace(1e-4, 0.02, diffusion_steps)
alpha = 1.0 - beta
alpha_bar = torch.cumprod(alpha, dim=0)
Using these settings, we can generate a sequence of noisy images with \begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} & \text{ where } \boldsymbol{\epsilon}_{t-1}, \boldsymbol{\epsilon}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \end{aligned}
There is a closed-form solution to compute $x_t$ directly from $x_0$ (see the paper or this blog): \begin{aligned} \mathbf{x}_t &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon} & \text{ where } \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \end{aligned}
12.2 a Add comment(a) Implement this closed-form solution in the code below to generate random images. (1 point)
Expected output: you should see some images with recognizable shapes and some images with noise.
Grading:
- 1 point for a correct implementation.
- 0 points if the scaling of x_0 or the noise is incorrect.
# sample some original images
x_0, y = next(iter(train_loader))
plt.figure(figsize=(19, 21))
for i, t in enumerate(range(0, diffusion_steps, 100)):
### BEGIN ANSWER
# generate random noise
noise = torch.randn_like(x_0)
# add noise to the original image to compute x_t
x_t = torch.sqrt(alpha_bar[t]) * x_0 + torch.sqrt(1 - alpha_bar[t]) * noise
### END ANSWER
plt.subplot(3, diffusion_steps // 100, i + 1)
plt.imshow(x_t[0, 0], cmap='gray')
plt.axis('off')
plt.title(f't={t}')
plt.tight_layout()
12.2 b Add comment(b) Describe how we should interpret these images, and how they can be used to train the diffusion model. (2 points)
Answer:
Starting from the original image at t=0, the image at step t+1 is computed by adding noise to the image at step t. This is repeated until the final image contains only random noise.
The diffusion model is trained to do this process in reverse: given a noisy image at step t+1, it is trained to predict the noise that was added to time image at step t.
Grading:
- 1 point for describing the forward process (image to noise).
- 1 point for describing the reverse process (noise to image).
During training, we will need a minibatch with multiple images and multiple time steps.
12.2 c Add comment(c) Complete the function below to add noise to a minibatch of images. (1 point)
Grading:
- 1 point for a correct implementation.
def generate_noisy_samples(x_0, beta):
'''
Create noisy samples for the minibatch x_0.
Return the noisy image, the noise, and the time for each sample.
'''
alpha = 1.0 - beta
alpha_bar = torch.cumprod(alpha, dim=0)
# sample a random time t for each sample in the minibatch
t = torch.randint(beta.shape[0], size=(x_0.shape[0],), device=x_0.device)
### BEGIN ANSWER
# generate random noise
noise = torch.randn_like(x_0)
# compute the noisy version for each sample
x_t = torch.sqrt(alpha_bar[t, None, None, None]) * x_0 + \
torch.sqrt(1 - alpha_bar[t, None, None, None]) * noise
### END ANSWER
return x_t, noise, t
(d) Try out your new function by generating a few noisy samples.
Expected output: you should see the samples in the minibatch with different levels of noise, depending on the time $t$ for each sample.
x_0, y = next(iter(train_loader))
x_t, noise, sampled_t = generate_noisy_samples(x_0, beta)
assert x_t.shape == x_0.shape
assert noise.shape == x_0.shape
assert sampled_t.shape == (x_0.shape[0],)
plt.figure(figsize=(12, 6))
for i in range(32):
plt.subplot(4, 8, i + 1)
plt.imshow(x_t[i, 0], cmap='gray')
plt.axis('off')
plt.title(f't={sampled_t[i]}')
plt.tight_layout()
12.3 Helper functions¶
We will use some predefined components to construct our model, based on an existing implementation on GitHub.
class SelfAttention(nn.Module):
def __init__(self, h_size):
super(SelfAttention, self).__init__()
self.h_size = h_size
self.mha = nn.MultiheadAttention(h_size, 4, batch_first=True)
self.ln = nn.LayerNorm([h_size])
self.ff_self = nn.Sequential(
nn.LayerNorm([h_size]),
nn.Linear(h_size, h_size),
nn.GELU(),
nn.Linear(h_size, h_size),
)
def forward(self, x):
x_ln = self.ln(x)
attention_value, _ = self.mha(x_ln, x_ln, x_ln)
attention_value = attention_value + x
attention_value = self.ff_self(attention_value) + attention_value
return attention_value
class SAWrapper(nn.Module):
def __init__(self, h_size, num_s):
super(SAWrapper, self).__init__()
self.sa = nn.Sequential(*[SelfAttention(h_size) for _ in range(1)])
self.num_s = num_s
self.h_size = h_size
def forward(self, x):
x = x.view(-1, self.h_size, self.num_s * self.num_s).swapaxes(1, 2)
x = self.sa(x)
x = x.swapaxes(2, 1).view(-1, self.h_size, self.num_s, self.num_s)
return x
# U-Net code adapted from: https://github.com/milesial/Pytorch-UNet
class DoubleConv(nn.Module):
def __init__(self, in_channels, out_channels, mid_channels=None, residual=False):
super().__init__()
self.residual = residual
if not mid_channels:
mid_channels = out_channels
self.double_conv = nn.Sequential(
nn.Conv2d(in_channels, mid_channels, kernel_size=3, padding=1, bias=False),
nn.GroupNorm(1, mid_channels),
nn.GELU(),
nn.Conv2d(mid_channels, out_channels, kernel_size=3, padding=1, bias=False),
nn.GroupNorm(1, out_channels),
)
def forward(self, x):
if self.residual:
return F.gelu(x + self.double_conv(x))
else:
return self.double_conv(x)
class Down(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.maxpool_conv = nn.Sequential(
nn.MaxPool2d(2),
DoubleConv(in_channels, in_channels, residual=True),
DoubleConv(in_channels, out_channels),
)
def forward(self, x):
return self.maxpool_conv(x)
class Up(nn.Module):
def __init__(self, in_channels, out_channels, bilinear=True):
super().__init__()
# if bilinear, use the normal convolutions to reduce the number of channels
if bilinear:
self.up = nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
self.conv = DoubleConv(in_channels, in_channels, residual=True)
self.conv2 = DoubleConv(in_channels, out_channels, in_channels // 2)
else:
self.up = nn.ConvTranspose2d(
in_channels, in_channels // 2, kernel_size=2, stride=2
)
self.conv = DoubleConv(in_channels, out_channels)
def forward(self, x1, x2):
x1 = self.up(x1)
# input is CHW
diffY = x2.size()[2] - x1.size()[2]
diffX = x2.size()[3] - x1.size()[3]
x1 = F.pad(x1, [diffX // 2, diffX - diffX // 2, diffY // 2, diffY - diffY // 2])
x = torch.cat([x2, x1], dim=1)
x = self.conv(x)
x = self.conv2(x)
return x
class OutConv(nn.Module):
def __init__(self, in_channels, out_channels):
super(OutConv, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=1)
def forward(self, x):
return self.conv(x)
12.4 Diffusion model (5 points)¶
Similar to Ho et al. and to several online implementations, we will use a U-Net with self-attention and positional embedding as our diffusion model.
(a) Familiarize yourself with the architecture of this U-Net.
class UNet(nn.Module):
def __init__(self, c_in=1, c_out=1, width=16):
super().__init__()
self.width = width
bilinear = True
self.inc = DoubleConv(c_in, self.width)
self.down1 = Down(self.width, self.width*2)
self.down2 = Down(self.width*2, self.width*4)
factor = 2 if bilinear else 1
self.down3 = Down(self.width*4, self.width*8 // factor)
self.up1 = Up(self.width*8, self.width*4 // factor, bilinear)
self.up2 = Up(self.width*4, self.width*2 // factor, bilinear)
self.up3 = Up(self.width*2, self.width, bilinear)
self.outc = OutConv(self.width, c_out)
self.sa1 = SAWrapper(self.width*4, 8)
self.sa2 = SAWrapper(self.width*4, 4)
self.sa3 = SAWrapper(self.width*2, 8)
def pos_encoding(self, t, channels, embed_size):
inv_freq = 1.0 / (
10000
** (torch.arange(0, channels, 2, device=t.device).float() / channels)
)
pos_enc_a = torch.sin(t[:, None].repeat(1, channels // 2) * inv_freq)
pos_enc_b = torch.cos(t[:, None].repeat(1, channels // 2) * inv_freq)
pos_enc = torch.cat([pos_enc_a, pos_enc_b], dim=-1)
return pos_enc.view(-1, channels, 1, 1).repeat(1, 1, embed_size, embed_size)
def forward(self, x, t):
"""
Model is U-Net with added positional encodings and self-attention layers.
"""
device = x.device
x1 = self.inc(x)
x2 = self.down1(x1) + self.pos_encoding(t, self.width*2, 16)
x3 = self.down2(x2) + self.pos_encoding(t, self.width*4, 8)
x3 = self.sa1(x3)
x4 = self.down3(x3) + self.pos_encoding(t, self.width*4, 4)
x4 = self.sa2(x4)
x = self.up1(x4, x3) + self.pos_encoding(t, self.width*2, 8)
x = self.sa3(x)
x = self.up2(x, x2) + self.pos_encoding(t, self.width, 16)
x = self.up3(x, x1) + self.pos_encoding(t, self.width, 32)
output = self.outc(x)
return output
12.4 b Add comment(b) What does the positional encoding encode? Why would this be useful? (2 points)
Answer:
The positional encoding encodes the time step t for the given sample.
The model needs the time step t to estimate the noise correctly: images in early steps (for small t) contain little noise, whereas images in later steps (large t) are very noisy. Providing t as an input allows the model to learn this.
Grading:
- 1 point for what it encodes.
- 1 point for why this is useful.
12.4 c Add comment(c) Describe how this model will be used. What do the inputs and outputs represent? (3 points)
Answer:
Used for: Given an image at step t, the model will predict the noise that was added to the image at step t-1. This estimated noise will be subtracted from image t to obtain an estimate of image t-1. Then the process is repeated to estimate image t-2 et cetera.
Inputs: The model receives the image at time t and the time step t that corresponds to this image.
Outputs: The model predicts the noise $\epsilon_t$, which is has the same shape as the image.
Grading:
- 1 point for each of: use of the model, inputs, and outputs.
12.5 Training the model (7 points)¶
We will train our diffusion model using Algorithm 1 from the paper Denoising Diffusion Probabilistic Models by Ho et al., 2020.
12.5 a Add comment(a) The algorithm uses $\mathbf{x}_0$. How do you obtain $\mathbf{x}_0$ during training? (1 point)
Answer:
In Algorithm 1, $x_0 \sim q(\mathbf{x}_0)$ is sampled from the distribution $q$ of real images. In practice, this means that we take a random image from the training set.
Grading:
- 1 point for taking a random $x_0$ from the training set.
12.5 b Add comment(b) Which two values are compared in the loss on line 5 of the algorithm? (1 point)
Answer:
The loss compares the actual random noise $\boldsymbol{\epsilon}_t$ against the noise predicted by the network from the noisy image, $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)$.
Grading:
- 1 point for a correct answer.
12.5 c Add comment(c) Implement this procedure in the training loop below. (3 points)
Grading:
- The training and test steps should be very similar. Check the training steps first.
- 1 point for generating the samples.
- 1 point for running the model to estimate the noise.
- 1 point for computing the loss.
- Then: subtract 1 point if your made different mistakes in test phase.
def train(model, data_loaders, beta, num_epochs=10, lr=1e-3, device=device):
"""Train a diffusion model"""
train_loader = data_loaders['train']
validation_loader = data_loaders['val']
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
plotter = Plotter(xlabel='epoch', xlim=[1, num_epochs],
legend=['train loss', 'validation loss'])
start_time = time.time()
for epoch in range(num_epochs):
model.train()
metrics = Metrics(1)
for x, y in train_loader:
x = x.to(device)
optimizer.zero_grad()
### BEGIN ANSWER
# generate a noisy minibatch
x_t, noise, sampled_t = generate_noisy_samples(x, beta.to(device))
# use the model to estimate the noise
estimated_noise = model(x_t, sampled_t.to(torch.float))
# compute the difference between the noise and the estimated noise
loss = F.mse_loss(estimated_noise, noise)
### END ANSWER
# Optimize
loss.backward()
optimizer.step()
# Track our progress
metrics.add(len(x), loss.item())
train_loss = metrics.mean()[0]
# Compute validation loss
validation_loss = evaluate(model, validation_loader, beta, device=device)
# Plot
plotter.add(epoch + 1, (train_loss, validation_loss))
train_time = time.time() - start_time
print(f'training loss {train_loss:.3g}, validation loss {validation_loss:.3g}')
print(f'{metrics.count * num_epochs / train_time:.1f} examples/sec '
f'on {str(device)}')
def evaluate(model, test_loader, beta, device=device):
"""Evaluate a diffusion model by computing the loss."""
with torch.no_grad():
model.eval()
metrics = Metrics(1)
for x, y in test_loader:
x = x.to(device)
### BEGIN ANSWER
x_t, noise, sampled_t = generate_noisy_samples(x, beta.to(device))
estimated_noise = model(x_t, sampled_t.to(torch.float))
loss = F.mse_loss(estimated_noise, noise)
### END ANSWER
metrics.add(len(x), loss.item())
return metrics.mean()[0]
class Metrics:
"""Accumulate mean values of one or more metrics."""
def __init__(self, n):
self.count = 0
self.sum = (0,) * n
def add(self, count, *values):
self.count += count
self.sum = tuple(s + count * v for s,v in zip(self.sum,values))
def mean(self):
return tuple(s / self.count for s in self.sum)
class Plotter:
"""For plotting data in animation."""
# Based on d2l.Animator
def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
ylim=None, xscale='linear', yscale='linear',
fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
figsize=(3.5, 2.5)):
# Incrementally plot multiple lines
if legend is None:
legend = []
self.fig, self.axes = plt.subplots(nrows, ncols, figsize=figsize)
if nrows * ncols == 1:
self.axes = [self.axes,]
# Use a function to capture arguments
def config_axes():
axis = self.axes[0]
axis.set_xlabel(xlabel), axis.set_ylabel(ylabel)
axis.set_xscale(xscale), axis.set_yscale(yscale)
axis.set_xlim(xlim), axis.set_ylim(ylim)
if legend:
axis.legend(legend)
axis.grid()
self.config_axes = config_axes
self.X, self.Y, self.fmts = None, None, fmts
def add(self, x, y):
# Add multiple data points into the figure
if not hasattr(y, "__len__"):
y = [y]
n = len(y)
if not hasattr(x, "__len__"):
x = [x] * n
if not self.X:
self.X = [[] for _ in range(n)]
if not self.Y:
self.Y = [[] for _ in range(n)]
for i, (a, b) in enumerate(zip(x, y)):
if a is not None and b is not None:
self.X[i].append(a)
self.Y[i].append(b)
self.axes[0].cla()
for x, y, fmt in zip(self.X, self.Y, self.fmts):
self.axes[0].plot(x, y, fmt)
self.config_axes()
display.display(self.fig)
display.clear_output(wait=True)
12.5 d Add comment(d) How does the training time depend on the number of diffusion steps $T$? (1 point)
Answer:
Training time does not depend on $T$, because we always provide the model with two images (the image at step t, and the noise that was added to t-1).
Grading:
- 1 point for saying that training time does not depend on $T$.
- It is also OK to suggest that changing $T$ might change the complexity of the task, requiring a longer/shorter training time, but your answer should at least mention that the number of inputs $t$ and $\epsilon_t$ does not depend on $T$.
(e) Train the model.
Expected output: in our implementation, the training loss started at around 0.1 and went down quickly to 0.03 and lower.
model = UNet().to(device)
train(model, data_loaders, beta, num_epochs=20, lr=1e-3)
training loss 0.0196, validation loss 0.0191 1840.5 examples/sec on cuda
12.5 f Add comment(f) Has the training converged? Do you think we should train longer? (1 point)
Answer:
No. It looks like both the training and test loss is still going down and training has not fully converged. So training longer could give a better model.
Grading:
- 1 point for a correct answer.
12.6 Sampling from the model (9 points)¶
Once the model is trained, we can sample from it using Algorithm 2 from paper Denoising Diffusion Probabilistic Models:
Algorithm 2:
1: $\mathbf{x}_T \sim \mathcal{N}(\mathbf 0, \mathbf I)$
2: for $t = T, \dots, 1$ do
3: $\mathbf{z} \sim \mathcal{N}(\mathbf 0, \mathbf I)$ if $t > 1$, else $\mathbf{z} = \mathbf{0}$
4: $\mathbf{x}_{t-1} = \frac{1}{\sqrt{a_t}} \left( \mathbf{x}_t - \frac{1-a_t}{\sqrt{1-\bar{a}_t}} \mathbf{\epsilon}_\theta \left(\mathbf{x}_t, t\right)\right) + \mathbf{\sigma}_t \mathbf{z}$
5: end for
6: return $\mathbf{x}_0$
12.6 a Add comment(a) In step 3 of algorithm 2, $\mathbf{z}$ is set to $\mathbf{0}$ some times. What is the effect of this? (1 point)
Answer:
For most of the steps, new noise $\mathbf{z}$ is added to the denoised image $\mathbf{x}_{t-1}$. If $\mathbf{z}$ is set to $\mathbf{0}$, as it is in the final step, no new noise is added to the image, which means that the output of the algorithm is exactly the image minus the estimated noise.
Grading:
- 1 point for explaining that $\mathbf{z} = \mathbf{0}$ does not add new noise to the image.
12.6 b Add comment(b) In step 4 of algorithm 2, $\mathbf{x}_{t-1}$ is computed based on three ingredients: $\mathbf{x}_t$, $\mathbf{\epsilon}_\theta \left(\mathbf{x}_t, t\right)$, and $\mathbf{z}$. What do these represent? (2 points)
Answer:
- $\mathbf{x}_t$: The noisy image at step $t$.
- $\mathbf{\epsilon}_\theta \left(\mathbf{x}_t, t\right)$: The estimate of the noise at step $t$, computed by the model $\theta$ using $\mathbf{x}$ and $t$ as inputs.
- $\mathbf{z}$: New noise that is added to the image if $t > 0$ is not the last step.
Grading:
- 2 points for three good descriptions.
- 1 point for two good descriptions.
12.6 c Add comment(c) How does the sampling time depend on the number of diffusion steps $T$? (1 point)
Answer:
Sampling time scales linearly with $T$.
Grading:
- 1 point for a correct answer.
12.6 d Add comment(d) Complete the code below to sample a minibatch from the model. (2 points)
- Use the equations in Algorithm 2.
- Use $\sigma_t = \sqrt{\beta_t}$, as suggested in the paper.
- Keep in mind that Algorithm 2 uses $t=1$ as the first time step, whereas we use $t=0$.
Expected output: after training, your model should generate fairly realistic, clean images when given random inputs.
Grading:
- Max. 2 points for a correct implementation of
sample_from_modeland sampling an input $\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \mathbf{1})$. - -1 point for each mistake.
def sample_from_model(x, model, beta, device=device):
# keep track of x at different time steps
x_hist = []
alpha = 1.0 - beta
alpha_bar = torch.cumprod(alpha, dim=0)
with torch.no_grad():
# loop over all time steps in reverse order
for i in reversed(range(0, beta.shape[0])):
# copy the time step for each sample in the minibatch
t = (torch.ones(x.shape[0]) * i).long().to(device)
### BEGIN ANSWER
# generate random noise for early time steps
z = torch.randn_like(x) if i > 0 else torch.zeros_like(x)
# define sigma as suggested in the paper
sigma = torch.sqrt(beta[i])
# compute the next x
x = (1 / torch.sqrt(alpha[i])) * \
(x - ((1 - alpha[i]) / torch.sqrt(1 - alpha_bar[i])) * model(x, t)) + \
sigma * z
### END ANSWER
if i % 100 == 0:
x_hist.append(x.detach().cpu().numpy())
return x, x_hist
def plot_x_hist(x_hist):
# plot the generated images
plt.figure(figsize=(10, 10))
for i in range(len(x_hist)):
for j in range(10):
plt.subplot(10, 10, j * 10 + i + 1)
plt.imshow(x_hist[i][j, 0], cmap='gray')
plt.axis('off')
### BEGIN ANSWER
# start with the initial noise
# shape: [10, 1, 32, 32]
x = torch.randn_like(x_0[:10]).to(device)
### END ANSWER
x, x_hist = sample_from_model(x, model, beta)
plot_x_hist(x_hist)
12.6 e Add comment(e) Explain the X and Y axes of this figure. (1 point)
Answer:
Each row shows the process for one sample. The columns show intermediate time steps: from almost random inputs on the left, to the final output on the right.
Grading:
- 1 point for a correct interpretation of the figure.
12.6 f Add comment(f) In a variational autoencoder or a GAN, the output is determined by the latent representation. How does that work for this diffusion model? (1 point)
Answer:
There is no clear latent representation in the diffusion model. The output depends on the initial noise provided as the input $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{1})$, but also on the noise $\mathbf{z}$ that is added in each intermediate step.
Grading:
- 1 point for a correct answer.
12.6 g Add comment(g) Look at the generated intermediate samples over time in question (d). Do we need all of the steps? Why/why not? (1 point)
Answer:
No, the images at t=500 still look completely random, so we could probably start from there, and not do the first iterations where not much denoising is happening. Or we could use a cosine noise schedule instead of a linear one.
Also, we are doing 1000 denoising steps, which is excessive. It is likely that fewer steps would also work.
Grading:
- 1 point for either of the two answers given above.
- 0 points for just `no'
12.7 Experiments (5 points)¶
Fixed initialization¶
How does the end result depend on the initialization? We will generate multiple images from the same initial noise to find out.
(a) Complete and run the code below.
Expected output: the model should still produce recognizable shapes.
### BEGIN ANSWER
# start with the initial noise for one image
# shape: [1, 1, 32, 32]
x = torch.randn_like(x_0[:1]).to(device)
### END ANSWER
# repeat this to generate 10 images from the same initialization
x = x.repeat(10, 1, 1, 1)
x, x_hist = sample_from_model(x, model, beta)
plot_x_hist(x_hist)
12.7 b Add comment(b) Does the model always produce the same output from the same initial input? Why, or why not? (2 points)
Answer:
No, the model does not always generate the same output from a given input.
The output depends on the initial input, but also on the noise $\mathbf{z}$ that is added in each intermediate step.
Grading:
- 1 point for running the experiment correctly (see (a)) and observing that the outputs are different.
- 1 point for giving the intermediate noise as the reason.
No randomness between time steps¶
To check the influence of noise during sampling, we can remove the term $\sigma_t \mathbf{z}$ from Algorithm 2.
12.7 c Add comment(c) Create a new function deterministic_sample_from_model, based on sample_from_model, that does not include this term. (2 points)
Grading:
- 2 points for a correct implementation.
def deterministic_sample_from_model(x, model, beta, device=device):
### BEGIN ANSWER
# keep track of x at different time steps
x_hist = []
with torch.no_grad():
# loop over all time steps in reverse order
for i in reversed(range(0, beta.shape[0])):
# copy the time step for each sample in the minibatch
t = (torch.ones(x.shape[0]) * i).long().to(device)
# compute the next x
x = (1 / torch.sqrt(alpha[i])) * \
(x - ((1 - alpha[i]) / torch.sqrt(1 - alpha_bar[i])) * model(x, t))
# note: no sigma * z
if i % 100 == 0:
x_hist.append(x.detach().cpu().numpy())
return x, x_hist
### END ANSWER
(d) Generate some samples using the new function.
Expected output: you should get a different result than before.
### BEGIN ANSWER
# start with the initial noise
# shape: [10, 1, 32, 32]
x = torch.randn_like(x_0[:10]).to(device)
### END ANSWER
x, x_hist = deterministic_sample_from_model(x, model, beta)
plot_x_hist(x_hist)
12.7 e Add comment(e) What can you conclude from these results? Is the random noise during sampling important? (1 point)
Answer:
The random noise during sampling is important. The model expects that the intermediate images contain some amount of noise, so providing noiseless images gives unexpected results.
Grading:
- 1 point for a correct answer.
12.8 Making the model conditional (6 points)¶
Similar to the conditional VAE in the previous assignment, we can make the diffusion model conditional by including class labels. This allows us to generate samples from a specific digit.
We will include the class information alongside the existing positional encoding, using a torch.nn.Embedding layer to map the 10 digits to a higher-dimensional space.
Conditional model¶
(a) Study the implemenation of UNetConditional to see how this works.
class UNetConditional(nn.Module):
def __init__(self, c_in=1, c_out=1, n_classes=10, width=16):
super().__init__()
self.width = width
bilinear = True
self.inc = DoubleConv(c_in, self.width)
self.down1 = Down(self.width, self.width*2)
self.down2 = Down(self.width*2, self.width*4)
factor = 2 if bilinear else 1
self.down3 = Down(self.width*4, self.width*8 // factor)
self.up1 = Up(self.width*8, self.width*4 // factor, bilinear)
self.up2 = Up(self.width*4, self.width*2 // factor, bilinear)
self.up3 = Up(self.width*2, self.width, bilinear)
self.outc = OutConv(self.width, c_out)
self.sa1 = SAWrapper(self.width*4, 8)
self.sa2 = SAWrapper(self.width*4, 4)
self.sa3 = SAWrapper(self.width*2, 8)
self.label_embedding = nn.Embedding(n_classes, self.width*4)
def pos_encoding(self, t, channels, embed_size):
inv_freq = 1.0 / (
10000
** (torch.arange(0, channels, 2, device=t.device).float() / channels)
)
pos_enc_a = torch.sin(t[:, None].repeat(1, channels // 2) * inv_freq)
pos_enc_b = torch.cos(t[:, None].repeat(1, channels // 2) * inv_freq)
pos_enc = torch.cat([pos_enc_a, pos_enc_b], dim=-1)
return pos_enc.view(-1, channels, 1, 1).repeat(1, 1, embed_size, embed_size)
def label_encoding(self, label, channels, embed_size):
return self.label_embedding(label)[:, :channels, None, None].repeat(1, 1, embed_size, embed_size)
def forward(self, x, t, label):
"""
Model is U-Net with added positional encodings and self-attention layers.
"""
x1 = self.inc(x)
x2 = self.down1(x1) + self.pos_encoding(t, self.width*2, 16) + self.label_encoding(label, self.width*2, 16)
x3 = self.down2(x2) + self.pos_encoding(t, self.width*4, 8) + self.label_encoding(label, self.width*4, 8)
x3 = self.sa1(x3)
x4 = self.down3(x3) + self.pos_encoding(t, self.width*4, 4) + self.label_encoding(label, self.width*4, 4)
x4 = self.sa2(x4)
x = self.up1(x4, x3) + self.pos_encoding(t, self.width*2, 8) + self.label_encoding(label, self.width*2, 8)
x = self.sa3(x)
x = self.up2(x, x2) + self.pos_encoding(t, self.width, 16) + self.label_encoding(label, self.width, 16)
x = self.up3(x, x1) + self.pos_encoding(t, self.width, 32) + self.label_encoding(label, self.width, 32)
output = self.outc(x)
return output
12.8 b Add comment(b) As in the paper by Ho et al., the position and label encoding are added in every layer of the model, instead of as an input to the first layer only. Why do you think the authors made this choice? (1 point)
Answer:
The position and label information contain important information for the denoising process. If the information is only included as the input of the first layer, the information gets mixed up with the other information from the image. It is much easier to train the model if the values are always available in every layer.
Grading:
- 1 point for a correct answer.
Conditional training loop¶
12.8 c Add comment(c) Create a new function train_conditional to train this model. (1 point)
Grading:
- 1 point for a correct implementation.
- The only difference compared to
trainis thatyis passed to the model.
def train_conditional(model, data_loaders, beta, num_epochs=10, lr=1e-3, device=device):
### BEGIN ANSWER
"""Train a diffusion model"""
train_loader = data_loaders['train']
validation_loader = data_loaders['val']
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
plotter = Plotter(xlabel='epoch', xlim=[1, num_epochs],
legend=['train loss', 'validation loss'])
start_time = time.time()
for epoch in range(num_epochs):
model.train()
metrics = Metrics(1)
for x, y in train_loader:
x = x.to(device)
y = y.to(device)
optimizer.zero_grad()
# generate a noisy minibatch
x_t, noise, sampled_t = generate_noisy_samples(x, beta.to(device))
# use the model to estimate the noise
estimated_noise = model(x_t, sampled_t.to(torch.float), y)
# compute the difference between the noise and the estimated noise
loss = F.mse_loss(estimated_noise, noise)
# Optimize
loss.backward()
optimizer.step()
# Track our progress
metrics.add(len(x), loss.item())
train_loss = metrics.mean()[0]
# Compute validation loss
validation_loss = evaluate_conditional(model, validation_loader, beta)
# Plot
plotter.add(epoch + 1, (train_loss, validation_loss))
train_time = time.time() - start_time
print(f'training loss {train_loss:.3g}, validation loss {validation_loss:.3g}')
print(f'{metrics.count * num_epochs / train_time:.1f} examples/sec '
f'on {str(device)}')
### END ANSWER
def evaluate_conditional(model, test_loader, beta, device=device):
### BEGIN ANSWER
"""Evaluate a conditional diffusion model by computing the loss."""
with torch.no_grad():
model.eval()
metrics = Metrics(1)
for x, y in test_loader:
x = x.to(device)
y = y.to(device)
x_t, noise, sampled_t = generate_noisy_samples(x, beta.to(device))
estimated_noise = model(x_t, sampled_t.to(torch.float), y)
loss = F.mse_loss(estimated_noise, noise)
metrics.add(len(x), loss.item())
return metrics.mean()[0]
### END ANSWER
(d) Train the conditional model.
Expected output: in our implementation, the training loss started at around 0.1 and went down quickly to 0.03 and lower.
model_conditional = UNetConditional().to(device)
train_conditional(model_conditional, data_loaders, beta, num_epochs=20, lr=1e-3)
training loss 0.0193, validation loss 0.0195 1822.7 examples/sec on cuda
Conditional sampling¶
12.8 e Add comment(e) Modify the sampling function to include a label. (1 point)
Grading:
- 1 point for a correct implementation.
def sample_from_model_conditional(x, model, beta, label):
### BEGIN ANSWER
# keep track of x at different time steps
x_hist = []
with torch.no_grad():
c = (torch.ones(x.shape[0]) * label).long().to(device)
# loop over all time steps in reverse order
for i in reversed(range(0, beta.shape[0])):
# copy the time step for each sample in the minibatch
t = (torch.ones(x.shape[0]) * i).long().to(device)
# generate random noise for early time steps
z = torch.randn_like(x) if i > 0 else torch.zeros_like(x)
# define sigma as suggested in the paper
sigma = torch.sqrt(beta[i])
# compute the next x
x = (1 / torch.sqrt(alpha[i])) * \
(x - ((1 - alpha[i]) / torch.sqrt(1 - alpha_bar[i])) * model(x, t, c)) + \
sigma * z
if i % 100 == 0:
x_hist.append(x.detach().cpu().numpy())
return x, x_hist
### END ANSWER
12.8 f Add comment(f) Sample a few digits with label 3. (1 point)
Expected output: you should see recognizable images with the number you requested.
Grading:
- 1 point for a correct implementation.
### BEGIN ANSWER
x = torch.randn_like(x_0[:10]).to(device)
x, x_hist = sample_from_model_conditional(x, model_conditional, beta, label=3)
plot_x_hist(x_hist)
### END ANSWER
12.8 g Add comment(g) Complete the code to sample and plot 10 samples for every digit. (1 point)
Grading:
- 1 point for a correct implementation.
x_per_class = []
for label in range(10):
# sample 10 digits with this label
### BEGIN ANSWER
x = torch.randn_like(x_0[:10]).to(device)
x, x_hist = sample_from_model_conditional(x, model_conditional, beta, label=label)
### END ANSWER
x_per_class.append(x.detach().cpu().numpy())
plt.figure(figsize=(10, 10))
for i in range(10):
for j in range(10):
plt.subplot(10, 10, j * 10 + i + 1)
plt.imshow(x_per_class[j][i, 0], cmap='gray')
plt.axis('off')
12.8 h Add comment(h) Compare the output of the conditional model with that of the unconditional model. Which one is better? (1 point)
Answer:
It depends on your results, but you might expect that the conditional model is slightly easier to train: it can concentrate on a specific digit, and might generate better images.
Arguably, it is not fair to directly compare the models, because the conditional model requires the input y, while the unconditional model does not.
Grading:
- 1 point for a reasonable answer that matches your results and observations.
12.9 Discussion (6 points)¶
12.9 a Add comment(a) Compare the sources of randomness in our diffusion model with that in the variational autoencoder and the GAN in earlier assignments. What are the main differences? (1 point)
Answer:
In the variational autoencoder and the GAN, the random input is concentrated in one place: the latent representation of the VAE, or the input of the GAN. In a diffusion model, the randomness is spread over all time steps.
Grading:
- 1 point for a correct answer.
- The answer should compare where randomness is used as an input.
- 0 points for comparing other aspects of GAN/VAE and diffusion models.
12.9 b Add comment(b) Would you be able to train a good digit classification model on the initial input to the sampling function? Why, or why not? (1 point)
Hint: for variational autoencoders, normalzing flows, and GANs, there is a clear link between the input (a latent feature vector) and the output of the decoder. How does this work for our diffusion model?
Answer:
No, the initial input is not a good representation for a classification model. First, it is difficult to extract the label from the noisy 32x32-pixel input. Second, the output of the diffusion model depends on the intermediate noise as well. In 12.7 (a), we saw that the same initial input can lead to any digit, so it would be impossible to predict the output based on the initial noise.
Grading:
- 1 point for explaining why this will not work.
12.9 c Add comment(c) When loading the data, we normalized the image intensities to [-1, +1], instead of [0, 1] or [0, 255]. Why is this a good input range for this diffusion model? (1 point)
Answer:
In the diffusion process, we add noise sampled from a normal random distribution $\mathcal{N}(0, 1)$ and mix this with the pixel values. This works best if the values have a similar range.
Grading:
- 1 point for a correct answer.
12.9 d Add comment(d) In this assignment, we use a $\beta$ schedule that has a small $\beta=1e-4$ at the initial time steps ($t=0$), and a larger $\beta=0.02$ at the end ($t=T$). Why is it useful to choose an increasing $\beta$? (1 point)
Answer:
The diffusion model starts with a noiseless image at t=0 and ends with a completely noisy image at t=T. At these later steps, we can afford and need more randomness to obtain images that follow the normal distribution. At the early steps, we need more precision and smaller noise to make the final adjustments, so we can obtain clean output images.
Grading:
- 1 point for a correct answer.
12.9 e Add comment(e) What would happen if we made $\beta$ very small? (1 point)
Answer:
If $\beta$ is very small, we either need a very large number of steps $T$, or we will not obtain a sufficiently noisy image. This invalidates the assumption that the final image $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{1})$.
Grading:
- 1 point for a correct answer.
12.9 f Add comment(f) What would happen if we made $\beta$ very large? (1 point)
Answer:
If $\beta$ is very large, we would very quickly get a random image $\mathbf{x}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{1})$ and do not need a large number of steps $T$. However, we would add a lot of noise in each step, which would make it very difficult to obtain good estimates from our model.
Grading:
- 1 point for a correct answer.
The end¶
Well done! Please double check the instructions at the top before you submit your results.
This assignment has 42 points. Version 70d23d4 / 2024-12-03